Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

!!! TASK: Split dimensionspacepoints into separate table to reduce data duplication #4790

Merged
merged 12 commits into from
Mar 18, 2024

Conversation

kitsunet
Copy link
Member

@kitsunet kitsunet commented Dec 1, 2023

This should reduce stored data amounts, also make queries and workspace copies faster.
The hierarchy relations are now entirely limited size field types and should not
need external lookup like "text" fields do.

Note that this needs a ./flow cr:setup and ./flow cr:projectionreplay --projection contentGraph

@github-actions github-actions bot added the 9.0 label Dec 1, 2023
@kitsunet kitsunet force-pushed the task/split-dimensionspacepoints branch from 038f0df to 4274928 Compare March 12, 2024 22:50
@kitsunet kitsunet marked this pull request as ready for review March 12, 2024 22:51
@kitsunet kitsunet force-pushed the task/split-dimensionspacepoints branch 2 times, most recently from 2239ab0 to 6bef019 Compare March 12, 2024 23:06
This should reduce stored data amounts, also make queries and workspace copies faster.
The hierarchy relations are now entirely limited size field types and should not
need external lookup like "text" fields do.
@kitsunet kitsunet force-pushed the task/split-dimensionspacepoints branch from 6bef019 to 3cf14ed Compare March 12, 2024 23:16
Copy link
Member

@bwaidelich bwaidelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good (I was just skimming over it on my phone), thanks for taking care!
Apart from the inline comments I wonder:
We assume, that we have a performance issue and that it is solved with this PR. I have no reason to doubt this, but did you measure that by any chance?

@kitsunet
Copy link
Member Author

Will take care of the rest, thank you!

But here is one example outcome on my local system, 3 times replay contentGraph on 9.0 and then again 3 times on this branch (with cr:setup before ofc)
Screenshot 2024-03-13 at 10 25 25
Screenshot 2024-03-13 at 10 25 13

That seems like a nice improvement for what is essentially the demo site with a few edits. Data size I cannot verify yet, but it's tricky because what is shown in clients and information_schema is not necessarily the full picture. Currently trying to get sensible numbers.

Copy link
Member

@bwaidelich bwaidelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great already, just one more question/comment.

But here is one example outcome on my local system

Impressive, almost 30% speed improvement \o/
@kitsunet thanks for going the extra mile and providing a basic benchmark.

Data size I cannot verify yet

I don't think, that's that important really.. it will most probably be in the same region

@@ -67,8 +67,8 @@ public function hierarchyIntegrityIsProvided(): Result
}

$invalidlyHashedHierarchyRelationRecords = $this->client->getConnection()->executeQuery(
'SELECT * FROM ' . $this->tableNamePrefix . '_hierarchyrelation
WHERE dimensionspacepointhash != MD5(dimensionspacepoint)'
'SELECT * FROM ' . $this->tableNamePrefix . '_hierarchyrelation h LEFT JOIN ' . $this->tableNamePrefix . '_dimensionspacepoints dsp ON dsp.hash = h.dimensionspacepointhash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what the purpose for this check is. But after it's not fully the same as before. So we just check if the entry for a given hash is present in _dimensionspacepoints. But before we also checked if the md5-hashed value is equal to the hash.

Copy link
Member Author

@kitsunet kitsunet Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmm, we can't know if the hash is the right one though, either it exists in the DSP table or it's invalid, if it exists in there IMHO we can assume it matches the dimension coordinates. We could add a separate check that reads all DSP entries and compares hash and getting the hash from the coordinates in the same row if we really want to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with keeping it as it is. Just wanted to point out this difference.

Copy link
Contributor

@dlubitz dlubitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me by reading!

@mhsdesign mhsdesign marked this pull request as draft March 15, 2024 14:07
@mhsdesign
Copy link
Member

Bastians pr should go first ;)

@kitsunet kitsunet marked this pull request as ready for review March 15, 2024 17:44
Copy link
Member

@mhsdesign mhsdesign left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow thank you ^^
Interestingly the speed gain is enormous with my 270 events of the neos demo + some extra stuff

Command Before (average of 3) After (average of 3)
time ./flow cr:projectionReplayAll --force ~21s ~1.5s
time ./flow cr:projectionReplay contentGraph --force ~29s ~5s

(the reason why there is a difference between all and one is some odd db lock, and also somehow this change avoids running into the lock state for a projectionReplayAll (previously the db had 95% cpu intensity))

Comment on lines +68 to +72
$coordinates = $this->getCoordinatesByHashFromRuntimeCache($hash);
if ($coordinates === null) {
$this->fillRuntimeCacheFromDatabase();
$coordinates = $this->getCoordinatesByHashFromRuntimeCache($hash);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha this is a bit hacky as we dont pass this repository as singleton that means it cannot share its runtime cache and would otherwise not pick up on new OriginDimensionSpacePoints, but i consider this a very safe runtime cache as nothing can go wrong (despite it being outdated, in which case we refetch)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, we could even make this static if we wanted LOL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah its fine thx, i guess asking the db like this is good enough and it will also work for async / parallel

@kitsunet
Copy link
Member Author

The reading side now shares one instance and the writing side uses the repository transactional with the respective other entry written (NodeRecord / HierachyRelation)

Comment on lines +143 to +150
$tableNamePrefix,
$nodeAggregateId,
$originDimensionSpacePoint,
$originDimensionSpacePointHash,
$properties,
$nodeTypeName,
$classification,
$timestamps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i love php

Copy link
Member

@mhsdesign mhsdesign left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement, its a nice bike.

@mhsdesign mhsdesign mentioned this pull request Mar 18, 2024
35 tasks
@kitsunet kitsunet merged commit c3a81af into neos:9.0 Mar 18, 2024
9 checks passed
@mhsdesign
Copy link
Member

A just a noob question @kitsunet

Why don’t we join but handle that separately? I guess performance and simplicity?

@mhsdesign mhsdesign changed the title Split dimensionspacepoints into separate table to reduce data duplication !!! TASK: Split dimensionspacepoints into separate table to reduce data duplication Mar 30, 2024
mhsdesign added a commit to mhsdesign/neos-development-collection that referenced this pull request Apr 24, 2024
…onspacepoints"

This reverts commit c3a81af, reversing
changes made to 117c052.
Comment on lines 68 to 70
return $table
->addIndex(['childnodeanchor'])
->addIndex(['contentstreamid'])
->addIndex(['parentnodeanchor'])
->addIndex(['contentstreamid', 'childnodeanchor', 'dimensionspacepointhash'])
->setPrimaryKey(['childnodeanchor', 'contentstreamid', 'dimensionspacepointhash', 'parentnodeanchor'])
->addIndex(['contentstreamid', 'dimensionspacepointhash']);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol after debugging i found out that this change here slows down the replay of ContentStreamWasForked events dramatically

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm, this fixes replay for me:
image

I would suggest that we revert that part for now as a "hotfix". Afterwards, we could investigate why that kills the database

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #5009

bwaidelich added a commit that referenced this pull request Apr 24, 2024
…acepoints"

This reverts commit c3a81af, reversing
changes made to 117c052.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants